Review

Tools
- Ch13. Linear Factor Models
- Ch14. Autoencoders
Overviews(?)
- Ch15. Representation Learning
Specific Issues
- Ch16. Structured Probabilistic Models for Deep Learning
- Ch17. Monte Carlo Methods
- Ch18. Confronting the Partition Function
- Ch19. Approximation Inference
- Ch20. Deep Generative Models

Introduction
- Definition of Representation
Greedy Layer-Wise Unsupervised Pretraining
- When and Why Does Unsupervised Pretraining Work?
Transfer Learning and Domain Adaptation
- Use shared representation
Semi-Supervised Disentangling of Casual Factors
- Use information from unsupervised tasks to perform supervised task
Distributed Representation
Exponential Gains from Depth
- Deep representation
Providing Clues to Discover Underlying Causes

Introduction

Representation

Arabic numeral representation VS Roman numeral representation
- 210 / 6 VS CCX / VI
Better representation in Machine Learning
- Good one makes a subsequent task easier
Almost learning algorithms learns "Representations" in the Deep Architecture
- Supervised/Unsupervised Learning learns "implicitly" as side effects
- Some algorithms designed explicitly for Representation Learning
  - e.g. Distribution Learing (Density Estimation)
Tradeoff Issue
- Preserving much information VS Nice properties (e.g. Independence)

Use Unlabeled Data for a good representation

Unsupervised Learning
Semi-suprevised learning

15.1 Greedy Layer-Wise Unsupervised Pretraining

참고 이미지: https://wikidocs.net/images/page/3413/glw.png (출처: https://wikidocs.net/3413)

Greedy Layer-Wise?
- optimizes each layer at a time rather than jointly optimizing all pices
Use single-layer representation learning algorithm
- RBM, single-layer autoencoder, sparse coding model (Ch13/14)
- Take the output of the previous layer
- Produce a new simpler representation

Good Initialization for a joint learning procedure over all the layers of a deep neural net for supervised task
Used to successfully train "even" fully connected architectures

Fine tuning after pretraining
- Optimizes all layers together
- Can be done in the pretraining phase (pretraining & fine-tuning simultaneously)
Can be viewed as a regularizer in supervised learning task
Overall training scheme is nearly the same
- learning algorithms, model types can differ

Initialization for unsupervised learning algorithms for...
- Deep autoencoders
- Probailistic models with many layers of latent variables
- Deep Generative Models (Ch20)
  - Deep belief networks
  - Deep Boltzmann machines

15.1.1 When and Why Does Unsupervised Pretraining Work?

History
- Substantial improvements in test error for "Classification Tasks"
  - Revival of deep neural networks (2006, Hinton)
- Harmful on many other tasks
- Ma,J. (2015, Deep neural nets as a method for quantitative sturucture) found...
  - Significantly helpful for many tasks
  - Slightly harmful on average
- So we should know "When and Why pretraining works" for a particular task

2 Intuitions
- Act as regularizer
  - e.g. Optimize only higher layers(classifier) freezing lower layers (feature extractor)
  - Prevent overfitting
  - Improve test set error
  - Speed up optimization
- Some features that are useful for the unsupervised task may also be useful for the supervised learning task
  - After extracting wheels, we can classify cars and motorcycles by counting wheels
Expected Values
- More effective when the initial input is poor
  - dimension reduction + manifold learning (Ch14)
  - e.g. good similarity metrics between two words for word embeddings
- User unlabeled data when labeled data is very small (Semi-supervised learning)
- Regularization for complicated functions

Why it works
- reduce the viraince of the estimation process
  - Figure 15.1 explanation
    - Input-output projection for visualization
    - variaous starting points (initialization)
    - blue -> red: time line, from origin to outside
    - points based on pretraing move to small region

Comparison to other ways
- Two "separate" phases
  - Increase hyperparameters => time consuming
- => one phase pretraining
  - Unsupervised learning and supervised learning simultaneously
  - Attach unsupervised learning term to objective function
Two phase VS one phase
- many hyperparams vs single hyperparam
- several trial-error iteration vs one-shot
- no way to control regularization term vs control it by the coefficient of unsupervised cost term
The popularity of unsupervised pretraining has declined
- Still popular in NLP(natural language processing)
- Regualized with dropout or Batch normalization for classification
  - outperform pretraing versions on even medium-size datasets
- Bayesian methods outperform on small datasets

Nevertheless unsupervised pretraining...
- an important milestone in the history of deep learing research
- continues to influence contemporary approaches

15.2 Transfer Learning and Domain Adaptation

One example problelm of Transfer learning
- How to use feature extractor from Zebra vs Horse for classification of Dalmatian vs Dog
In transfer learning, the learner must perform two or more different tasks
- e.g. Learn on significantly more data (P1), apply the learned transformation on P2(Small data)

Sharing layers
- Share lower layers (Underlying factor in low level feature) => Multi-task learning
  - e.g. Visual categorizing
    - low-level notions of "Edges" and "Visual shapes" (corner? circle?)
- Share higher layers (Speech recognitoin) => Domain Adaptation
Domain Adaptation (Sharing Higher Layer)
- Same task -> Different distrtibution P
- e.g. Learning positive/Negative sentiment
  - Task1: about Music, Task2: about Movies
  - Why?: vocabulary and style vary from one domain to another

Concept Drift
- Gradual changes in the data distribution over time

While the phrase "multi-task learning" typically refers to supervised learning tasks, the more general notion of transfer learning is applicable to unsupervised learning and reinforcement learning as well.

Same representation may be useful in both settings
- e.g. Transfer learning competition
  - Mesnil, G. 2011, Unsupervised and tranfer learning challenge: a deep learing approach
  - 1st: Learn on $P_1$
  - 2st: Apply the learned transformation to $P_2$
  - Result
    - deeper representations => faster learning $P_2$
Two examples: One-shot learning and zero-shot(zero-data) learning
- Extreme forms of transfer learning
- One-shot: One example in the 2nd stage
  - e.g.
    - learn "wheels" from images of bikes n cars
    - learn the one image of a 3-wheel bike
    - test on images of 3-wheel bikes
- Zero-shot
  - Testing without data in the 2nd stage???
  - Learn 2 representations and their relation
  - e.g. Text-Image learning
    - Link text space("4 Legs") - Image space(visual shape of legs and their count)
    - Learn Birds("2 Legs", "No Ear"), Dogs("2 Legs", "Round Ears")
    - Input: Text about Cats (4 Legs, Pointy ears)
    - Apply to the images of Cats
  - e.g. Machine translation
    - We can translate sentences even though some word has no label
    - X in language A - Y in language B have similar behavior => Same meaning

Zero-shot Model
- $P(y| x, T)$
  - Traditional input $x$
  - Traditional output $y$
  - Additional random variables, Task $T$
  - e.g. $x$ is descriptions about cats, $y$ is "yes" or "no", $T$ is "Is there a cat in this image?"
```
If we have a training set containing unsupervised examples of objects that live in the same space as T , we may be able to infer the meaning of unseen instances of T.
```
  - $T$ should be represented in a way that allows some of generalization.
    - "Is there a sort of "animals" in this image?

15.3 Semi-Supervised Disentangling of Causal Factors

Large amount of unlabeled data and relatively little labeled data

$P(x)$ is helpful for $P(y|x)$
Causal Factor -(Representation)-> Feature
Better Representations?
1. Representation disentangles the causes from one another
2. Easy to model
  - e.g. Simple model: sparsity, independence
Hypothesis motivation of Semi-supervised learning
- If (1), (2) conside =>
- If a representation $h$ represents many of the underlying causes of the observed $x$
  - the outputs $y$ are among the most "salient" causes, then it is easy to predict $y$ from $h$.
  - $P(y|x)$, $P(x|h)$, $P(h)$
- c.f. If $P(x)$ is uniformly distributed => Semi-supervised learning fails
- Simple example

Issus: Hard to capture salient factors
- Two Strategy
  1. Use a supervised learning signal (labeld data)
  2. Use much larger representation

Adversarial Framework (CH 20)
- Modify the definition of which underlying causes are most salient.

15.4 Distributed Representation

Symbolic Representation
- N features -> N Symbol or categories
- e.g.
  - Red Car, Green Car, Blue Car, Red Truck, Green Truck, Blue Truck, Red Bird, Green Bird, Blue Bird
- N "Binary" features => One hot representation
  - e.g.
    - Red Car = [1, 0, 0, 0, 0, 0, 0, 0, 0]
    - Blue Bird = [0, 0, 0, 0, 0, 0, 0, 0, 1]
  - Still "Sparse" Representation (CH 1)
Distributed Representation?
- e.g.
  - Red Car = [[1, 0, 0], [1, 0, 0]] ([[Red Bit, Green Bit, Blue Bit],[Car Bit, Truck Bit, Bird Bit]])
  - Blue Bird = [[0, 0, 1], [0, 0, 1]]
- Not all values are feasible
  - e.g. [1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0] are not feasible
One-hot representation VS Distribution Representation
- Only One entry can be active VS Multiple entry can be active
- Representation Dimension
  - $n^d$ VS $n\times d$
    - $d:= \text{Input Dim or Feature Dim}$, $n:= \text{Feature Value Dim}$
    - e.g. $3^2$ VS $2\times3$
  - Not Powerful VS Powerful
The combination of a Powerfule Representation Layer and a Week Classifier Layer can be a strong regularizer
- A classifier trying to learn the concept of "person" vs "not a person" does not need to assign a different class to an input represented as "woman with glasses".
  - (person vs not a person), (man vs woman), (with glasses vs without glasses)
- This capacity constraint encourages each classifier to focus on few $h_i$ and encourages $\mathbf{h}$ to learn to represent the classes in a linearly separable way
  - some classifier focuses (man vs woman), another one focuses (with glasses vs without glasses)
Non-distributed Representation
- Example Type 1: Input point is assigned to exactly one cluster.
  - Clustering methods: K-means algorithm
  - Decision Trees
- Example Type 2: Entries cannot be controlled separately from each other.
  - K-nearest neighbors algorithms
  - Gaussian Mixtures and Mixtures of Experts
  - Kernel machines with a Gaussian kernel
- ???
  - N-grams + Tree of suffixes (CH 12)
Distibuted Represetation
- Generalization arises due to "shared attributes"
  - "cat" and "dog"
    - "has_fur" or "number_of_legs" have same value for the embedding of both.
- Induce a rich "Similarity Spacee"
  - Semantically close concepts are close in "Distance"
    - "cat" is closer to "dog" than "snake"
When and Why can there be a statistical advantage from using a distributed representation as part of a learning algorithm?
- When
  - complicated structure can be compactly represented using a small number of parameters (dim vector size)
  - Bigger Dim of parameters -> larger degree of freedom -> larger regions -> larger data
- Why
  - The Number of Distinguishable Regions using linear threshold units
  - $d:= \text{Input Dim}$, $n:= \text{Feature Value Dim}$
    - $\Sigma_{j=0}^{d} \binom{n}{j}= O(n^{d})$ using only $O(nd)$
  - Can extended to the case using nonlinear units
    - represent Larger regions using Smaller parameter
    - fewer example to generalize well
  - Effective in Capacity
    - If $w$ is number of weights, VC Dimension is $O(w\mathbf{log}w)$,
      - VC Dimension(Vapnik-Chervonenkis)?
        
        VC of One Linear classifier is 3
  - Learning about each of them without having to see all the configurations of all the others
    - e.g. If we learn about man with glasses, man without glasses and woman without glasses
      - we can infer woman with glasses.

15.5 Exponential Gains from Depth

Functions can be represented by exponentially smaller deep networks compared to shallow networks
- Small Deep networks can represents better than shallow networks
e.g. Generative Model
- Need highly nonlinear ways to the input in order to generate data
- High nonlinearity
  - Composition of many nonlinearities and a hierarchy of reused features can give an exponential boost to statistical efficiency, on top of the exponential boost given by using a Distributed Representation
  - Deep Network + Distributed Representation => High nonlinearity
  - e.g. Simple Universal Approximator (Ch 6)
    - Boolean gates, sum/products, RBF with even a single hidden layer
    - Can approximate a large class of functions
    - Expressive Power
      - Need "exponential" number of hidden units in order to have same expressive power of architecture with additional 1 depth.
  - Similar Result on...
    - Deterministic fead forward network as a universal approximators of "Probability Distribution"
      - Many structured probabilistic models with a single hidden layer of latert variables (Ch 16)
      - e.g. Boltzman machines, Deep Belief Networks
      - Deeper one can have "exponential" advantage over a shallow one.
    - sum-product network for probabilistic models (SPN)
    - Deep circuits related to convolutional networks (Convolutional sum-product network)

15.6 Providing Clues to Discover Underlying Causes

What makes one representation better than another?
- One that disentangles the underlying causal factors
- Learner separate these observed factors from the others
- Introduce clues that help the learning to find these underlying factors from the others.
- Type 1: Supervised Learning
  - Provides a very strong clue a label $\mathbf{y}$
- Type 2: Using of abundant Unlabeled data
  - hints about the underlying factors
    - take the form of implicit prior beliefs that the designers of the learning algorithm impose in order to guide the learner.
    - Regularization strategies are necessary to obtain good generalization
    - One goal of deep learning is to find a set of fairly generic regularization strategies.
Generic Regularization Stretegies
- Smoothness
  - $f(x + \epsilon d) \approx f(x)$ for unit $d$
  - allow to generalize from training examples to nearby points in input space
- Linearity
  - Relationships btw some variables are linear.
  - Make predictions even very far from the observed data,
    - sometimes lead to overrly extreme prediction
  - Simple machine learning algorithms use "Linearity" instead of "Smoothness"
    - Linearity and Smoothness are different assumption in "High Dimensional" space
- Multiple Explanatory Factors (Output)
  - Motivation $p(x) \approx p(y|x)$ in Semi-Supervised Learning
  - Distributed Representation
- Causal Factors (Input)
  - Underlying Causal Factor $h$ in Semi-Supervised Learning
- Depth or a Hierarchical Organization of Explanatory Factors
  - High level can be defined in terms of simple concepts forming a hierarchy.
    - Cat (High level), pointy ears, 4 legs (Lower level)
  - Multi step program
    - Each step (layer) refer back to the output via previous step (layer)
- Shared Factors across Tasks
  - In Many tasks: differernt $\mathbf{y_i}$ outputs sharing $\mathbf{x}$ input.
    - There are $f^{i}(\mathbf{x})$ of a global $\mathbf{x}$
    - Each $\mathbf{y_i}$ is associated with a different subset from $\mathbf{h}$
      - $P(\mathbf{y_i} | \mathbf{x})$ depends on $P(\mathbf{h} | \mathbf{x})$
- Manifolds
  - Regions in which probability mass concentrates are...
    - locally connected
    - occupy a tiny volume
  - These regions can be approximaed by Low-dimensional manifolds with a much smaller Dim
  - Motivate Autoencoders
- Natural Clustering
  - Each manifold in the input space may be assigned to a single class.
  - The data may lie on many disconnected manifolds
  - Motivate tangent propagation, double backprop, manifold tangent classifier, adversarial training
- Temporal and Spatial Coherence
  - Most important explanatory factors change slowly over time
- Sparsity
  - Most features should presumabley not be relevant to describing most inputs
- Simplicity of Factor Dependencies
  - The simplest example
    - $P(\mathbf{h}) = \Pi_i(\mathbf{h_i})$
  - Linear dependencies, dependencies captured by a autoencoder
  - Motivate many laws of physics
  - Motivate a linear predictor or a factorized prior on top of a learned representation

Review

Contents

Introduction

Representation

Use Unlabeled Data for a good representation

15.1 Greedy Layer-Wise Unsupervised Pretraining

15.1.1 When and Why Does Unsupervised Pretraining Work?

15.2 Transfer Learning and Domain Adaptation

15.3 Semi-Supervised Disentangling of Causal Factors

15.4 Distributed Representation

15.5 Exponential Gains from Depth

15.6 Providing Clues to Discover Underlying Causes